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Abstract We consider a class of inverse problems in which the forward model 
is the solution operator to linear ODEs or PDEs. This class admits several di- 
mensionality-reduction techniques based on data averaging or sampling, which are 
especially useful for large-scale problems. We survey these approaches and their 
connection to stochastic optimization. The data-averaging approach is only viable, 
however, for a least-squares misfit, which is sensitive to outliers in the data and 
artifacts unexplained by the forward model. This motivates us to propose a robust 
formulation based on the Student's t-distribution of the error. We demonstrate 
how the corresponding penalty function, together with the sampling approach, can 
obtain good results for a large-scale seismic inverse problem with 50% corrupted 
data. 

Keywords inverse problems • seismic inversion • stochastic optimization • robust 
estimation 

1 Introduction 

Consider the generic parameter-estimation scheme in which we conduct m exper- 
iments, recording the corresponding experimental input vectors {91,92, • • • ,1m) 
and observation vectors {di, cfe, • • • , dm}- We model the data for given parameters 
x e R" by 

d t = F l (x)q l + et for i = l,...,m, (1.1) 

This work was in part financially supportcd by the Naturai Sciences and Engineering Research 
Council of Canada Discovery Grant (22R81254) and the Collaborative Research and Develop- 
mcnt Grant DNOISE II (375142-08). This research was carried out as part of the SINBAD II 
project with support from the following organizations: BG Group, BPG, BP, Chevron, Conoco 
Phillips, Pctrobras, PGS, Total SA, and WesternGeco. 

A. Aravkin, F. J. Herrmann, and T. van Leeuwen 

Dept. of Earth and Ocean Sciences, University of British Columbia, Vancouver, BC, Canada 
E-mail: {saravkm,fhcrrmann,tleeuwen}@eos.ubc.ca 

M. P. Friedlander 

Dept. of Computer Science, University of British Columbia, Vancouver, BC, Canada 
E-mail: mpf@cs.ubc.ca 



2 



A. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. vari Leeuwen 



where observation di is obtained by the linear action of the forward model Fi (x) 
on known source parameters qi, and independent errors ti capture the discrepancy 
between di and prediction Fi(x)qi. The class of models captured by this repre- 
sentation includes solution operators to any linear (partial) differential equation 
with boundary conditions, where the qi are the right-hand sides of the equations. 
A special case arises when Fi = F, Le., the forward model is the same for each 
experiment. 

Inverse problems based on these forward models arise in a variety of applications, 
including medicai imaging and seismic exploration, in which the parameters x 
usually represent particular physical properties of a material. We are particularly 
motivated by the full-waveform inversion (FWI) application in seismology, which is 
used to image the earth's subsurface [38]. In full-waveform inversion, the forward 
model F is the solution operator of the wave equation composed with a restriction 
of the full solution to the observation points (receivers) ; x represents sound- velo city 
parameters for a (spatial) 2- or 3-dimensional mesh; the vectors qi encode the 
location and signature of the zth source experiment; and the vectors di contain 
the corresponding measurements at each receiver. A typical survey in exploration 
seismology may contain thousands of experiments (shots), and global seismology 
relies on naturai experiments provided by measuring thousands of earthquakes 
detected at seismic stations around the world. Standard data-fitting algorithms may 
require months of CPU time on large computing clusters to process this volume of 
data and yield coherent geological information. 

Inverse problems based on the forward models that satisfy (1.1) are typically 
solved by minimizing some measure of misfit, and have the general form 

1 m 

minimize 0(aO := — / <pi{x), (1-2) 
i=l 

where each <pi(x) is some measure of the residuai 

n(x) := di - Fi{x) qi (1.3) 

between the observation and prediction of the zth experiment. The classical approach 
is based on the least-squares penalty 

4>i(x) = \\n(x)\\ 2 . (1.4) 

This choice can be interpreted as finding the maximum likelihood (ML) estimate of 
x, given the assumptions that the errors ti are independent and follow a Gaussian 
distribution. 

Formulation (1.2) is general enough to capture a variety of models, including 
many familiar examples. If the di and qi are scalars, and the forward model is 
linear, then standard least-squares 

4>i{x) = \{ajx - di) 2 

easily fits into our general formulation. More generally, ML estimation is based on 
the form 

(f>i(x) = -\ogp i (r l (x)), 
where pi is a particular probability density function of ti. 
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1.1 Dimensionality reduction 

Full-waveform inversion is a prime example of an application in which the cost 
of evaluating each element in the sum of <p is very costly: every residuai vector 
ri (a:) — required to evaluate one element in the sum of (1.2) — entails solving a 
partial differential equation on a 2D or 3D mesh with thousands of grid points in 
each dimension. The scale of such problems is a motivation for using dimensionality 
reduction techniques that address small portions of the data at a time. 

The least-squares objective (1.4) allows for a powerful form of data aggregation 
that is based on randomly fusing groups of experiments into "meta" experiments, 
with the effect of reducing the overall problem size. The aggregation scheme is based 
on Haber et al.'s [17] observation that for this choice of penalty, the objective is 
connected to the trace of a residuai matrix. That is, we can represent the objective 
of (1.2) by 

1 m 1 

4>{x) = —Y, IM^H 2 = - trace (R(x) T R(x)) , (1.5) 

i=l 

where 

R(x) := [ri(x),r 2 (x), . . . , r m {x)) 

collects the residuai vectors (1.3). Now consider a small sample of s weighted 
averages of the data, Le., 

m m 

dj = }Wijdi and = } j mjqi, 3 = 1, 

i=l i=l 

where s <C m and Wij are random variables, and collect the corresponding s 
residuals 7j(x) = dj — Fj(x)cfj into the matrix R w (x) := [Fi(x),r2(a;), . . . ,r s (x)]. 
Because the residuals are linear in the data, we can write compactly 

R w (x) := R(x)W where W := (w iS ). 

Thus, we may consider the sample function 

1 s 1 

4>w{x) = ) j \\ r j( x )\\ 2 = - tra ce {R w {x) T R w (x)) (1.6) 

S 3=1 S 

based on the s averaged residuals. Proposition 1.1 then follows directly from 
Hutchinson's [22, §2] work on stochastic trace estimation. 

Proposition 1.1. IfE[WW T ] = I, then 

R[<f> w (x)] = 4>(x) and E[V<f> w (x)} = V<j>{x). 

Hutchinson proves that if the weights w%j are drawn independently from a 
Rademacher distribution, which takes the values ±1 with equal probability, then 
the stochastic-trace estimate has minimum variance. Avron and Toledo [4] compare 
the quality of stochastic estimators obtained from other distributions. Golub and 
von Matt [15] report the surprising result that the estimate obtained with even a 
single sample (s = 1) is often of high quality. Experiments that use the approach 
in FWI give evidence that good estimates of the true parameters can be obtained 
at a fraction of the computational cost required by the full approach [19,24,41]. 
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1.2 Approach 

Although the least-squares approach enjoys widespread use, and naturally accom- 
modates the dimensionality-reduction technique just described, it is known to be 
unsuitable for non-Gaussian errors, especially for cases with very noisy or corrupted 
data often encountered in practice. The least-squares formulation also breaks down 
in the face of systematic features of the data that are unexplained by the model Fi. 

Our aim is to characterize the benefits of robust inversion and to describe 
randomized sampling schemes and optimization algorithms suitable for large-scale 
applications in which even a single evaluation of the forward model and its action 
on qi is computationally expensive. (In practice, the product Fi{x)qi is evaluated 
as a single unit.) We interpret these sampling schemes, which include the well- 
known incremental-gradient algorithm [28], as dimensionality-reduction techniques, 
because they allow algorithms to make progress using only a portion of the data. 

This paper is organized into the following components: 

Robust statistics (§2). We survey robust approaches from a statistical perspective, 
and present a robust approach based on the heavy-tailed Student's t-distribution. 
We show that ali log-concave error models share statistical properties that differ- 
entiate them from heavy-tailed densities (such as the Student's t) and limit their 
ability to work in regimes with large outliers or significant systematic corruption 
of the data. We demonstrate that densities outside the log-concave family allow 
extremely robust formulations that yield reasonable inversion results even in the 
face of major data contamination. 

Sample average approximations (§3). We propose a dimensionality-reduction 
technique based on sampling the available data, and characterize the statistical 
properties that make it suitable as the basis for an optimization algorithm to solve 
the general inversion problem (1.2). These techniques can be used for the general 
robust formulation described in §2, and for formulations in which forward models 
Fi vary with i. 

Stochastic optimization (§4) We review stochastic-gradient, randomized in- 
cremental-gradient, and sample-average methods. We show how the assumptions 
required by each method fit with the class of inverse problems of interest, and can 
be satisfìed by the sampling schemes discussed in §3. 

Seismic inversion (§5) We test the proposed sample-average approach on the 
robust formulation of the FWI problem. We compare the inversion results obtained 
with the new heavy-tailed approach to those obtained using robust log-concave 
models and conventional methods, and demonstrate that a useful synthetic velocity 
model can be recovered by the heavy-tailed robust method in an extreme case with 
50% missing data. We also compare the performance of stochastic algorithms and 
deterministic approaches, and show that the robust result can be obtained using 
only 30% of the effort required by a deterministic approach. 

2 Robust Statistics 

A popular approach in robust regression is to replace the least-squares penalty (1.4) 
on the residuai with a penalty that increases more slowly than the 2-norm. (Virieux 
and Operto [42] discuss the difficulties with least-squares regression, which are 
especially egregious in seismic inversion.) 
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One way to derive a robust approach of this form is to assume that the noise 
sì comes from a particular non-Gaussian probability density, pi, and then find 
the maximum likelihood (ML) estimate of the parameters x that maximizes the 
likelihood that the residuai vectors ri(x) are realizations of the random variable e», 
given the observations di. Because the negative logarithm is monotone decreasing, 
it is naturai to minimize the negative log of the likelihood function rather than 
maximizing the likelihood itself. In fact, when the distribution of the errors ej is 
modeled using a log-concave density 

p(r) oc exp ( - p(r)), 

with a convex loss function p, the ML estimation problem is equivalent to the 
formulation (1.2), with 

<j)i{x) = p(n(x)) for i=l,...,m. (2.1) 

One could also simply start with a penalty p on r%{x), without explicitly 
modelling the noise density; estimates obtained this way are generally known 
as M-estimates [20]. A popular choice that follows this approach is the Huber 
penalty [20,21,27]. 

Robust formulations are typically based on convex penalties p — or equivalently, 
on log-concave densities for gj — that look quadratic near and increase linearly far 
from zero. Robust penalties, including the 1-norm and Huber, for electromagnetic 
inverse problems are discussed by Farquaharson and Oldenburg in [13]. Guitton 
and Symes [16] consider the Huber penalty in the seismic context, and they cite 
many previous examples of the use of 1-norm penalty in geophysics. Huber and 
1-norm penalties are further compared on large-scale seismic problems by Brossier 
et al. [7], and a Huber-like (but strictly convex) hyperbolic penalty is described by 
Bube and Nemeth [9], with the aim of avoiding possible non-uniqueness associated 
with the Huber penalty. 

Clearly, practitioners have a preference for convex formulations. However, it is 
important to note that 

— for nonlinear forward models Fi, the optimization problem (1.2) is typically 
nonconvex even for convex penalties p (it is difhcult to satisfy the compositional 
requirements for convexity in that case); 

— even for linear forward models Fi, it may be beneficiai to choose a nonconvex 
penalty in order to guard against outliers in the data. 

We will justify the second point from a statistical perspective. Before we proceed 
with the argument, we introduce the Student's t-density, which we use in designing 
our robust method for FWI. 



2.1 Heavy-tailed distribution: Student's t 

Robust formulations using the Student's t-distribution have been shown to out- 
perform log-concave formulations in various applications [1]. In this section, we 
introduce the Student's t-density, explain its properties, and establish a result 
that underscores how different heavy-tailed distributions are from those in the 
log-concave family. 
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The scalar Student's t-density function with mean /i and positive degrees-of- 
freedom parameter v is given by 

p(r |M,^)oc (l + (r- M ) 2 /^)" (1+I/)/2 . (2.2) 

The density is depicted in Figure l(a). The parameter v can be understood by 
recalling the origins of the Student's t-distribution. Given n i.i.d. Gaussian variables 
xì with mean fi, the normalized sample mean 

(2.3) 



S/y/n 

follows the Student's t-distribution with v = n — 1, where the sample variance 
S = j- y^fxj — a;) is distributed as a \ random variable with n — 1 degrees 
of freedom. As v — » oo, the characterization (2.3) immediately implies that the 
Student's t-density converges pointwise to the density of N(0, 1). Thus, v can be 
interpreted as a tuning parameter: for low values one expects a high degree of 
non-normality, but as v increases, the distribution behaves more like a Gaussian 
distribution. This interpretation is highlighted in [25]. 

For a zero-mean Student's t-distribution (/i = 0), the log-likelihood of the 
density (2.2) gives rise to the nonconvex penalty function 

p(r) =log(l + r 2 H, (2.4) 

which is depicted in Figure l(b). The nonconvexity of this penalty is equivalent to 
the sub-exponential decrease of the tail of the Student's t-distribution, which goes 
to at the rate l/r v+1 as r —} oo. 

The significance of these so-called heavy tails in outlier removal becomes clear 
when we consider the following questioni Given that a scalar residuai deviates from 
the mean by more than t, what is the probability that it actually deviates by more 
than 2t? 

The 1-norm is the slowest-growing convex penalty, and is induced by the Laplace 
distribution, which is proportional to exp(— 1| • ||i). A basic property of the scalar 
Laplace distribution is that it is memory free. That is, given a Laplace distribution 
with mean 1/a, then the probability relationship 

Pr(H > t 2 | |r| > ti) = Pr(|r| > t 2 - ti) = exp(-a[£ 2 - ti]) (2.5) 

holds for ali Ì2 > ti. Hence, the probability that a scalar residuai is at least 2t away 
from the mean, given that it is at least t away from the mean, decays exponentially 
fast with t. For large t, it is unintuitive to make such a strong claim for a residuai 
aheady known to correspond to an outlier. 

Contrast this behavior with that of the Student's t-distribution. When v = 1, 
the Student's t-distribution is simply the Cauchy distribution, with a density 
proportional to 1/(1 + r 2 ). Then we have that 

? - arctan(2£) 1 

lim Pr( r > 2t \r\ > t) = lim \ y —^- = -. 

t-»oo t^oo e _ arctan(t) 2 



Remarkably, the conditional probability is independent of t for large residuals. This 
cannot be achieved with any probability density arising from a convex penalty, 
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because (2.5) provides a lower bound for this family of densities, as is shown in 
the following theorem. 



Theorem 2.1. Consider any scalar density p arising frora a symmetric proper 
closed convex penalty p via p(t) = exp(— p(t)), and take any point to with positive 
righi derivative oto = d+p(to) > 0. Then for ali t 2 > t\ > to, the conditional tail 
distribution induced by p(r) satisfies 

Pr(H > t 2 | \r\ > ti) < exp(-a [Ì2 - il]) . 



Proof. Define £(t) = p(ti)+ai(t— ti) tobethe (global) linear under-estimate for p at 
ti, where ai = d+p{t\) is the right derivative of p at ti. Define F(t) = p(r)dr. 
We first note that F(t) is log-concave (apply [33, Theorem 3], taking the set 
A = {z | z > 0}). Then \og(F(t)) is concave, and so its derivative 

log(F(i))' = 

is non-increasing. Therefore, the ratio p(t)/F(t) is nondecreasing, and in particular 

p(h) , p(* 2 ) . , F(t 2 ) P (t 2 ) 

— -. — V < 7 . , or equivalently, — j — f < —. — - . 
F(ti) ~ F(t 2 ) ' 4 y ' F(ti) ~ p(ti) 

By assumption on the functions i and p, 

p{t 2 )-l{t 2 ) > p(ti)-£(ti) = 0, 

which implies that 

Prflrl > t 2 I Irl > ti) = ^ < exp( ~ p(t2)) 
11 1 > 2 1 1 1 > lj F(ti) S exp(-p(t 1 )) 

= exp(-[p(t 2 )-£(t 1 )]) 

< exp(-[t(t 2 ) -£(ti)]) 

= exp(— ai[*2 - ti}) . 

To complete the proof, note that the right derivative d+p(t) is nondecreasing [34, 
Theorem 24.1]. Then we have ao < ai for to < ti. □ 

For differentiable log-concave densities, the influence function is defined to be 
p'(t), and for a general distribution it is the derivative of the negative log of the 
density. These functions provide further insight into the difference between the 
behaviors of log-concave densities and heavy-tailed densities such as the Student's. 
In particular, they measure the effect of the size of a residuai on the negative log 
likelihood. The Student's t-density has a so-called redescending influence function: 
as residuals grow larger, they are effectively ignored by the model. Figure 1 shows 
the relationships among densities, penalties, and influence functions of two log- 
concave distributions (Gaussian and Laplacian) and those of the Student's t, which 
is not log-concave. If we examine the derivative 
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Fig. 1: The Gaussian ( — ), Laplace ( — ), and Student's t- ( — ) distributions: (a) 
densities, (b) penalties, and (c) influence functions. 

of the Student's t-penalty (2.4), it is clear that large residuate have a small influence 
when r 2 3> v. For small r, on the other hand, the derivative resembles that of the 
least-squares penalty. See Hampel et al. [18] for a discussion of influence-function 
approaches to robust statistics, and redescending influence functions in particular, 
and Shevlyakov et al. [35] for further connections. 

There is an implicit tradeoff between convex and non-convex penalties (and 
their log-concave and non- log-concave counterparts) . Convex models are easier to 
characterize and solve, but may be wrong in a situation in which large outliers are 
expected. Nonconvex penalties are particularly useful with large outliers. 



2.2 The Student's t in practice 

Figure 2 compares the reconstruction obtained using the Student's t-penalty, with 
those obtained using least-squares and Huber penalties, on an FWI experiment 
(described more fully in §5). These panels show histograms of the residuals (1.3) that 
are obtained at different solutions, including the true solution, and the solutions 
recovered by solving (1.2) where the subfunctions (f>i in (2.1) are defined by the 
least-squares, Huber, and Student's t- penalties. 

The experiment simulates 50% missing data using a random mask that zeros 
out half of the data obtained via a forward model at the true value of x. A residuai 
histogram at the true x therefore contains a large spike at 0, corresponding to 
the residuals for correct data, and a multimodal distribution of residuals for the 
erased data. The least-squares recovery yields a residuai histogram that resembles 
a Gaussian distribution. The corresponding inversion result is useless, which is not 
surprising, because the residuals at the true solution are very far from Guassian. 
The reconstruction using the Huber penalty is a significant improvement over the 
conventional least-squares approach, and the residuai has a shape that resembles 
the Laplace distribution, which is closer to the shape of the true residuai. The 
Student's t approach yields the best reconstruction, and, remarkably, produces 
a residuai distribution that matches the multi-modal shape of the true residuai 
histogram. This is surprising because the Student's t-distribution is unimodal, but 
the residuai shape obtained using the inversion formulation is not. It appears that 
the statistical prior implied by the Student's t-distribution is weak enough to allow 
the model to converge to a solution that is almost fully consistent with the good 
data, and completely ignors the bad data. 
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(a) Trae model residuai and solution 





-3 3 5 10 15 20 25 30 

(d) Student's t residuai and solution 



Fig. 2: Residuai histograms (normalized) and solutions for an FWI problem. The 
histogram at (a) the true solution shows that the errors follow a tri-modal distri- 
bution (superimposed on the other histogram panels for reference). The residuals 
for (b) least-squares and (c) Huber reconstructions follow the model error densities 
(Le., Gaussian and Laplace). The residuals for (d) the Student t reconstruction, 
however, closely match the distribution of the actual errors. 
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Despite several successful applications in statistics and control theory [12,25], 
Student's t-formulations do not enjoy widespread use, especially in the context 
of nonlinear regression and large-scale inverse problems. Recently, however, they 
were shown to work very well for robust recovery in nonlinear inverse problems 
such as Kalman smoothing and bundle adjustment [1], and to outperform the 
Huber penalty when inverting large synthetic models [2,3]. Moreover, because the 
corresponding penalty function is smooth, it is usually possible to adapt existing 
algorithms and workflows to work with a robust formulation. 

In order for algorithms to be useful with industrial-scale problems, it is essential 
that they be designed for conventional and robust formulations that use a relatively 
small portion of the data in any computational kernel. We lay the groundwork for 
these algorithms in the next section. 

3 Sample average approximations 

The data-averaging approach used to derive the approximation (1.6) may not 
be appropriate when the misfìt functions fa are something other than the 2- 
norm. In particular, a result such as Proposition 1.1, which reassures us that the 
approximations are unbiased estimates of the true functions, relies on the special 
structure of the 2-norm, and is not available to us in the more general case. In 
this section, we describe sampling strategies — analogous to the stochastic-trace 
estimation procedure of §1.1 — that allow for more general misfit measures fa. In 
particular, we are interested in a sampling approach that allows for differential 
treatment across experiments i, and for robust functions. 

We adopt the useful perspective that each of the constituent functions fa and 
the gradients X7fa are members of a fixed population of size m. The aggregate 
objective function and its gradient, 

^ 7TL 1 m 

fax) = — }^fa(x) and Vfax) = — } j Vfa(x), 
m i=i m i=l 

can then simply be considered to be population averages of the individuai objectives 
and gradients, as reflected in the scaling factors 1/m. A common method for esti- 
mating the mean of a population is to sample only a small subset S C { 1, . . . , m } 
to derive the sample averages 

fa(x) = -J2M X ) and Vfa(x) = Vfa{x), (3.1) 

s ies s ies 

where s = \S\ is the sample size. We build the subset S as a uniform random 
sampling of the full population, and in that case the sample averages are unbiased: 

E[fa(x)} = fax) and K[Vfa(x)] = Vfax). (3.2) 

The cost of evaluating these sample-average approximations is about s/m times 
that for the true function and gradient. (Non-uniform schemes, such as importance 
and stratifìed sampling, are also possible, but require prior knowledge about the 
relative importance of the fa.) We use these quantities to drive the optimization 
procedure. 
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This approach constitutes a kind of dimensionality-reduction scheme, and it 
is widely used by census takers to avoid the expense of measuring the entire 
population. In our case, measuring each element of the population means an 
evaluation of a function <f>i and its gradient V(/>i . The goal of probability sampling 
is to design randomized sampling schemes that estimate statistics — such as these 
sample averages — with quantifiable error; see, for example, Lohr's introductory 
text [26]. 

The stochastic-optimization methods that we describe in §4 allow for approx- 
imate gradients, and thus can take advantage of these sampling schemes. The 
error analysis of the sample-average method described in §4.3 relies on the second 
moment of the error 

e = V^s - V<^> (3.3) 

in the gradient. Because the sample averages are unbiased, the expected value of 
the squared error of the approximation reduces to the variance of the norm of the 
sample average: 

E[||e|| a ]=V[||V0 a ||]. (3.4) 

This error is key to the optimization process, because the accuracy of the gradient 
estimate ultimately determines the quality of the search directions available to the 
underlying optimization algorithm. 



3.1 Sampling with and without replacement 

Intuitively, the size s of the random sample influences the norm of the error e 
in the gradient estimate. The difference between uniform sampling schemes with 
or without replacement greatly affects how the variance of the sample average 
decreases as the sample size increases. In both cases, the variance of the estimator 
is proportional to the sample variance 

1 m 

i=l 

of the population of gradients { V<£i, . . . , V^ m } evaluated at x. This quantity is 
inherent to the problem and independent of the chosen sampling scheme. 

When sampling from a finite population without replacement (i.e., every element 
in S occurs only once), then the Grror Gn of the sample average gradient satisfies 

E[IM| 2 ] = ^(l-£K; (3.6) 

for example, see Cochran [11] or Lohr [26, §2.7]. Note that the expected error 
decreases with s, and — importantly — is exactly when s = m. On the other hand, 
in a sample average gradient built by uniform sampling with replacement, every 
sample draw of the population is independent of the others, so that the error e r of 
this sample average gradient satisfies 

E[||e r || 2 ] = l -a 2 g . (3.7) 

This error goes to as 1/s, and is never when sampling over a finite population. 
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Comparing the expected error between sampling with and without replacement 
for finite populations, we note that 



and so sampling without replacement yields a uniformly lower expected error than 
independent finite sampling. 



3.2 Data averaging 

The data-averaging approach discussed in §1.1 for the objective (1.5) does not 
immediately fit into the sample-average framework just presented, even though the 
function <p w defined in (1.6) is a sample average. Nevertheless, for ali sampling 
schemes described by Proposition 1.1, the sample average 



is in some sense a sample average of an infinite population. If the random vectors 
are uncorrelated — as required by Proposition 1.1 — than, as with (3.7), the error 



of the sample average gradient is proportional to the sample variance of the 
population of gradients of cj>w That is, 



where is the sample variance of the population of gradients { V</>i, . . . , V(èm }. 

The particular value of <7g will depend on the distribution from which the 
weights wì are drawn; for some distributions of Wi this quantity may even be 
infinite, as is shown by the following results. 

The sample variance (3.5) is always finite, and the analogous sample variance 
<7g of the implicit functions V^>i is finite under general conditions on tu. 



Proposition 3.1. The sample variance <jg of the population { V0i, ■ ■ ■ , V^m } 
of gradients is finite when the distribution for w% has finite fourth moments. 





V<t> w 
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Proof. The claim follows from a few simple bounds (ali sums run from 1 to m): 



? 2 <E HV^II 2 



4E 



< 4E 



< 4E 



4E 



^ WiVri(x) 



\\w'Vri(x)\\ 2 '^2 \\win(x 

i i 

^KlIlVn^H^KIIIn^ 



< 4maxm ||Vr»(a;)||2 • max ||r,(a;)|| E 



E 



2 2 



The quantity E[5^ ■ wfw'j] < oo when the fourth moments are finite. 



□ 



As long as a g is nonzero, the expected error of uniform sampling without 
replacement is asymptotically better than the expected error that results from data 
averaging. That is, 



E[||e„|| 2 ] < E[||e,„|| 2 ] for ali s large enough. 



At least as measured by the second moment of the error in the gradient, the simple 
random sampling without replacement has the benefit of yielding a good estimate 
when compared to these other sampling schemes. 



4 Stochastic optimization 

Stochastic optimization, which naturally allows for inexact gradient calculations, 
meshes well with the various sampling and averaging strategies described in §3. 
We review several approaches that fall under the stochastic optimization umbrella, 
and describe their relative benefits. 

Although the full-waveform inversion application that we consider is nonconvex, 
the following discussion make the assumption that the optimization problem 
is convex; this expedient concedes the analytical tools that allow us to connect 
sampling with rates of convergence. It is otherwise difficult to connect a convergence 
rate to the sample size; see [14, §2.3]. The usefulness of the approach is justified by 
numerical experiments, both in the present paper and in [14, §5], where results for 
both convex and nonconvex models are presented. 
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4.1 Stochastic gradient methods 

Stochastic gradient methods for minimizing a differentiable function (j), not neces- 
sarily of the forni defined in (1.2), can be generically expressed by the iteration 

x k +i = x k - a k d k with d k := s k + e*,, (4.1) 

where a k is a positive stepsize, s k is a descent direction for <f>, and e k is a random 
noise terni. Bertsekas and Tsitsiklis [6, Prop. 3] give general conditions under which 

lini V<t>(xk) = 0, 

k— >oo 

and every limit point of {xk} is a stationary point of <f>. Note that unless the 
minimizer is unique, this does not imply that the sequence of iterates {x k } converges. 
Chief among the required conditions are that is globally Lipschitz, i.e., for 
some positive L, 

\\V(j)(x) - V(j>(y)\\ < L\\x - y\\ for ali x and y; 

that for ali fc, 

slV^Xk) < - m ||V<Mx fc )|| 2 , (4.2a) 

\\s k \\ < M2 (l + ||V0(x fc )||), (4.2b) 

E[e fe ] =0 and E[||e fc || 2 ] < fi 3 , (4.2c) 

for some positive constants fxi, fi2, and fj,3; and that the steplengths satisfy the 
infinite travel and summable conditions 

oo oo 

«fc = oo and a% < oo. (4-3) 

fc=Q fe=0 

Many authors have worked on similar stochastic-gradient methods, but the Bert- 
sekas and Tsitsiklis [6] is particularly general; see their paper for further references. 

Note that the randomized sample average schemes (with or without replacement) 
from §3 can be immediately used to design a stochastic gradient that satisfies (4.2b). 
It suffices to choose the sample average of the gradient (3.1) as the search direction: 

d fc = \7(j) s (x k ). 

Because the sample average V0 S is unbiased — cf. (3.2) — this direction is on average 
simply the steepest descent, and can be interpreted as having been generated from 
the choices 

s k =\7<j)(x k ) and e k = V 4> s {x k ) - V (f)(x k ). 

Moreover, the sample average has finite variance — cf. (3.6)-(3.7) — and so the 
direction s k and the error e k clearly satisfy conditions (4.2). 

The same argument holds for the data-averaging scheme outlined in §1.1, as 
long as the distribution of the mixing vector admits an unbiased sample average 
with a finite variance. Propositions 1.1 and 3.1 establish conditions under which 
these requirements hold. 
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Suppose that <fi is strongly convex with parameter fi, which implies that 



-\\xk ~ x*\\ 2 < (j){x k ) - 4>(x*), 



where x* is the unique minimizer of (j>. Under this additional assumption, further 
statements can be made about the rate of convergence. In particular, the iteration 
(4.1), with Sfe = X7<f>(xk), converges sublinearly, Le., 



where the steplengths a k = C(l/fc) are decreasing [30, §2.1]. This is in fact the 
optimal rate among ali first-order stochastic methods [29, §14.1]. 

A strength of the stochastic algorithm (4.1) is that it applies so generally. AH 
of the sampling approaches that we have discussed so far, and no doubt others, 
easily fit into this framework. The convergence guarantees are relatively weak for 
our purposes, however, because they do not provide guidance on how a sampling 
strategy might influence the speed of convergence. This analysis is cruciai within 
the context of the sampling schemes that we consider, because we want to gain an 
understanding of how the sample size influences the speed of the algorithm. 



4.2 Incremental-gradient methods 

Incremental-gradient methods, in their randomized form, can be considered a 
special case of stochastic gradient methods that are especially suited to optimizing 
sums of functions such as (1.2). They can be described by the iteration scheme 



for some positive steplengths a.k, where the index selects among the m constituent 
functions of <j>. In the deterministic version of the algorithm, the ordering of the 
subfunctions (pi is predetermined, and the counter i% = (fc mod m) + 1 makes a 
full sweep through ali the functions every m iterations. In the randomized version, 
ik is at each iteration randomly selected with equal probability from the indices 
1, ... ,m. (The Kaczmarz method for linear system [23] is closely related, and a 
randomized version of it is analyzed by Strohmer and Vershynin [37].) 

In the context of the sampling discussion in §3, the incremental-gradient algo- 
rithm can be viewed as an extreme sampling strategy that at each iteration uses 
only a single function <j>i (i.e., a sample of size s = 1) in order to form a sample 
average <j) s of the gradient. For the data-averaging case of §1.1, this corresponds to 
generating the approximation <p w from a single weighted average of the data (i.e., 
using a single random vector Wi to form R(x)wì). 

Bertsekas and Tsitsiklis [5, Prop. 3.8] describe conditions for convergence of 
the incremental-gradient algorithm for functions with globally Lipschitz continuous 
gradients, when the steplengths a k — > as specifìed by (4.3). Note that it is 
necessary for the steplengths a k — > in order for the iterates x k produced by (4.5) 
to ensure stationarity of the limit points. Unless we assume that V</>(x) = implies 
that V</>i(x) = for ali i, a stationary point of cf> is not a fixed point of the iteration 
process; Solodov [36] and Tseng [39] study this case. Solodov [36] further describes 



E[\\x k -x*\\]=0(l/k). 



(4.4) 



Xk+i = x k - a k V(f> ik (x k ), 



(4.5) 
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how bounding the steplengths away from zero yields limit points x that satisfy the 
approximate stationarity condition 

\\V<t>(x)\\=0(Ma k ). 

k 

With the additional assumption of strong convexity of <j>, it follows from Nedic 
and Bertsekas [28] that the randomized incremental-gradient algorithm with a 
decreasing stepsize a k = G(l/k) converges sublinearly accordingly to (4.4). They 
also show that keeping the stepsize Constant as a k = m/L implies that 

n\\x k - x*fì < 0([1 - n/L] k ) + 0(m/L). 

This expression is interesting because the first terni on the right-hand side decreases 
at a linear rate, and depends on the condition number fi/L of <j>\ this term is present 
for any deterministic first-order method with Constant stepsize. Thus, we can see 
that with the strong-convexity assumption and a Constant stepsize, the incremental- 
gradient algorithm has the same convergence characteristics as steepest descent, 
but with an additional Constant error term. 

4.3 Sampling methods 

The incremental-gradient method described in §4.2 has the benefit that each 
iteration costs essentially the same as evaluating only a single gradient element 
V0i. The downside is that they achieve only a sublinear convergence to the exact 
solution, or a linear convergence to an approximate solution. The sampling approach 
described in Friedlander and Schmidt [14] allows us to interpolate between the 
one-at-a-time incremental-gradient method at one extreme, and a full gradient 
method at the other. 

The sampling method is based on the iteration update 

Xk+i = x k - ag k , a = l/L, (4.6) 

where L is the Lipschitz Constant for the gradient, and the search direction 

g k = V(f>(x k ) + e k (4.7) 

is an approximation of the gradient; the term e k absorbes the discrepancy between 
the approximation and the true gradient. We defìne the direction g k in terms of 
the sample average gradient (3.1), and then e k corresponds to the error defined 
in (3.3). 

When the function <f> is strongly convex and has a globally Lipschitz continuous 
gradient, than the following theorem links the convergence of the iterates to the 
error in the gradient. 



Theorem 4.1. Suppose that K[\\e k \\ 2 } < B k , where lim^^oo B k+ i/B k < 1. Then 
each iteration of algorithm (4.6) satisfies for each k = 0,1,2, ... , 

E[||i fc - z*|| 2 ] < 0([1 - M/i]") + 0(C fc ), (4-8) 
where C k = max{_Bfc, (1 — fj,/L + e) k } for any positive e. 
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Fig. 3: Comparing the difference between the theoretical errors bounds in the sample 
averages for three sampling strategies (randomized with replacement, randomized 
without replacement, and deterministic) . (a) Sample sizes, as fractions of the total 
population m = 1000, required to reduce the error linearly with error Constant 0.9. 
(b) The corresponding cumulative number of samples used. See bounds (4.10). 



It is also possible to replace gk in (4.6) with a search direction pk that is the 
solution of the system 

H k p = g k , (4.9) 

for any sequence of Hessian approximations H k that are uniformly positive definite 
and bounded in norm, as can be enforced in practice. Theorem 4.1 continues to 
hold in this case, but with different constants fi and L that reflect the conditioning 
of the "preconditioned" function; see [14, §1.2]. 

It is useful to compare (4.4) and (4.8), which are remarkably similar. The 
distance to the solution, for both the incremental-gradient method (4.5) and the 
gradient-with-errors method (4.6), is bounded by the same linearly convergent term. 
The second terms in their bounds, however, are crucially different: the accuracy of 
the incremental-gradient method is bounded by a multiple of the fixed steplength; 
the accuracy of the gradient-with-errors method is bounded by the norm of the 
error in the gradient. 

Theorem 4.1 is significant because it furnishes a guide for refining the sample 
Sk that defines the average approximation 

9k = — ^2 4>i( x *>) 

of the gradient of <f>, where Sk is the size of the sample Sk; cf. (3.1). In particular, (3.6) 
and (3.7) give the second moment of the errors of these sample averages, which 
correspond precisely to the gradient error defined by (4.7). If we wish to design a 
sampling strategy that gives a linear decrease with a certain rate, then a policy for 
the sample size Sk needs to ensure that it grows fast enough to induce E[||efc|| 2 ] to 
decrease with at least that rate. Also, from (4.8), it is clear that there is no benefit 
in increasing the sample size at a rate faster than the underlying "pure" first-order 
method without gradient error. If, for example, the function is poorly conditioned — 
i.e., fi/L is small — than the sample-size increase should be commensurately slow. 
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It is instructive to compare how the sample average error decreases in the 
randomized (with and without replacement) and deterministic cases. We can more 
easily compare the randomized and deterministic variants by following Bertsekas 
and Tsitsiklis [5, §4.2], and assuming that 



||V^(a;)|| 2 < Pi + /? 2 ||V0(V)|| 2 for ali x and i 



1. 



for some constants /3i > and Pi > 1. Together with the Lipschitz continuity of < 
we can provide the following bounds: 



2 1 

randomized, without replacement E[||efc|| ] < — 

Sk 

. 1 



rn 



randomized, with replacement IE[|| | 
deterministic \\ek\ 



< 4 



m — s k 
ni 



Pk (4.10a) 
(4.10b) 

(4.10c) 



where Pk = Pi + iPiV^ixk) — 4>(x*)]. These bounds follow readily from the 
derivation in [14, §§3.1-3.2]. Figure 3 illustrates the difference between these 
bounds on an example problem with m = 1000. The panel on the left shows how 
the sample size needs to be increased in order for the right-hand-side bounds 
in (4.10) to decrease linearly at a rate of 0.9. The panel on the right shows the 
cumulative sample size, Le., Y^Ì=o Si ' Uniform sampling without replacement yields 
a uniformly and significantly better bound than the other sampling strategies. Both 
types of sampling are admissible, but sampling without replacement requires a 
much slower rate of growth of s to guarantee a linear rate. 

The strong convexity assumption needed to derive the error bounds used in 
this section is especially strong because the inverse problem we use to motivate 
the sampling approach is not a convex problem. In fact, it is virtually impossible 
to guarantee convexity of a composite function such as (2.1) unless the penalty 
function p(-) is convex and each n(-) is affine. This is not the case for many 
interesting inverse problems, such as full waveform inversion, and for nonconvex 
loss functions corresponding to distributions with heavy tails, such as Student's t. 

Even relaxing the assumption on (f> from strong convexity to just convexity 
makes it difficult to design a sampling strategy with a certain convergence rate. 
The full-gradient method for convex (but not strongly) functions has a sublinear 
convergence rate of C(l/fc). Thus, ali that is possible for a sampling-type approach 
that introduces errors into the gradient is to simply maintain that sublinear rate. 
For example, if ||efc|| 2 < Bk, and X^fcLi Bk < oo, then the iteration (4.6) maintains 
the sublinear rate of the gradient method [14, Theorem 2.6]. The theory for the 
strongly convex case is also supported by empirical evidence, where sampling 
strategies tend to outperform basic incremental-gradient methods. 



5 Numerical experiments in seismic inversion 

A good candidate for the sampling approach we have discussed is the full waveform 
inversion problem from exploration geophysics, which we address using a robust 
formulation. The goal is to obtain an estimate of subsurface properties of the 
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earth using seismic data. To collect the data, explosive charges are detonated just 
below the surface, and the energy that reflects back is recorded at the surface by a 
large array of geophones. The resulting data consist of a time-series collection for 
thousands of source positions. 

The estimate of the medium parameters is based on fitting the recorded and 
predicted data. Typically, the predicted data are generated by solving a PDE whose 
coefficients are the features of interest. The resulting PDE-constrained optimization 
problem can be formulated in either the time [38] or the frequency [32] domain. 
It is common practice to use a simple scalar wave equation to predict the data, 
effectively assuming that the earth behaves like a fluid — in this case, sound speed 
is the parameter we seek. 

Raw data are processed to remove any unwanted artifacts; this requires signifi- 
cant time and effort. One source of unwanted artifacts in the data is equipment 
malfunction. If some of the receivers are not working properly, the resulting data 
can be either zero or contaminated with an unusual amount of noise. And even if 
we were to have a perfect estimate of the sound speed, we stili would not expect to 
be able to fit our model perfectly to the data. The presence of these outliers in the 
data motivates us (and many other authors, e.g., [7,8, 16]) to use robust methods 
for this application. We compare the results of robust Student's t-based inversion 
to those obtained using least-squares and Huber robust penalties, and we compare 
the performance of deterministic, incremental-gradient, and sampling methods in 
this setting. 

5.1 Modelling and gradient computation for full waveform inversion 

The forward model for frequency-domain acoustic FWI, for a single source function 
q, assumes that wave propagation in the earth is described by the scalar Helmholtz 
equation 

Au:(x)u = [uj 2 x + V 2 ]u = q, 
where oj is the angular frequency, x is the squared-slowness (seconds/meter) 2 , and 
u represents the wavefield. The discretization of the Helmholtz operator includes 
absorbing boundary conditions, so that A u> (x) and u are complex-valued. The data 
are measurements of the wavefield obtained at the receiver locations d = Pu. The 
forward modelling operator F(x) is then given by 

F(x) = PA~ 1 {x), 

where A is a sparse block-diagonal matrix, with blocks A u indexed by the frequencies 
ùj. Multiple sources qi are typically modeled as discretized delta functions with a 
frequency-dependent weight. The resulting data are then modeled by the equation 
di = F(x)qi, and the corresponding residuai equals ri(x) = di — F{x)qi (cf. (1.3)). 
For a given loss function p, the misfit function and its gradient are defined as 

m m 

<j>{x) = Y;p(n{x)) and V^)=^VF(i,9 1 ) , Vp(r- 1 (i)), 

i=l i=l 

where \7F(x, qt) is the Jacobian of F(x)qi. The action of the adjoint of the Jacobian 
on a vector y can be efficiently computed via the adjoint-state method [38] as 
follows: 

VF(x,qi)*y = G(x,u z )*Vi, 
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where G{x, iti) is the (sparse) Jacobian of A[x)uì with respect to x, and u% and Vi 
are solutions of the linear systems 

A(x)m = qi and A(x)*Vi = Py. 

The Huber penalty function for a vector r is 



P( r ) = ^2d> where C* = | 



rt/2fi if|n|</x 
\vì\ — /i/2 otherwise. 



The Student's t penalty function (2.4) for a vector r is defined by 

p(r) = ^log(l + r?/z/). 



5.2 Experimental setup and results 

For the seismic velocity model x* e R 60501 on a 201-by-301 grid depicted in 
Figure 2(a), observed data d (a complex-valued vector of length 272,706) are 
generated using 6 frequencies, 151 point sources, and 301 receivers located at the 
surface. To simulate a scenario in which half of the receivers at unknown locations 
have failed, we multiply the data with a mask that zeroes out 50% of the data at 
random locations. We emphasize that the model was blind to this corruption, and 
so we could have equivalently added a large perturbation to the data, as was done 
for example in [3]. The resulting data thus differ from the prediction F(x*) given 
by the true solution x* . A spike in the histogram of the residuals ri(x*) evaluated 
at the true solution x* , shown in Figure 2(a), shows these outliers. The noise does 
not fit well with any simple prior distribution that one might like to use. We solve 
the resulting optimization problem with the least-squares, Huber, and Student t- 
penalties using a limited-memory BFGS method. Figure 4 tracks across iterations 
the relative model error \\xk — x*\\/\\x*\\ for ali three approaches. Histograms of the 
residuals after 50 iterations are plotted in Figures 2(c)-(e). The residuals for the 
least-squares and Huber approaches resemble Gaussian and Laplace distributions 
respectively. This fits well with the prior assumption on the noise, but does not 
fit the true residuai at ali. The residuai for the Student's t approach does not 
resemble the prior distribution at ali. The slowly increasing penalty function allows 
for enough freedom to let the residuai evolve into the true distribution. 

Next, we compare the performance of the incremental-gradient (§4.2) and 
sampling (§4.3) algorithms against the full-gradient method. For the incremental- 
gradient algorithm (4.5), at each iteration we randomly choose i uniformly over the 
set { 1, 2, . . . , m }, and use either a fixed stepsize a.k = a or a decreasing stepsize 
Qfc = a/\_k/m\. The sampling method is implemented via the iteration 

Xk + l = X k - CXkPk, 

where pk satisfies (4.9), and Hk is a limited-memory BFGS Hessian approximation. 
The quasi-Newton Hessian Hk is updated using the pairs (Axk, Agk), where 

Ax k := x k +i - x k and Ag k := gk+i - 9k; 
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°' 6 10 20 30 40 50 

iteration 

Fig. 4: Relative error between the true and reconstructed models for least-squares, 
Huber, and Student t penalties. In the least-squares case, the model error is not 
reduced at ali. Slightly better results are obtained with the Huber penalty, although 
the model error starts to increase after about 20 iterations. The Students t penalty 
gives the best result. 




Fig. 5: (a) Convergence of different optimization strategies on the Students t penalty: 
Limited-memory BFGS using the full gradient ("full"), incrementai gradient with 
Constant and decreasing step sizes, and the sampling approach. Different lines of 
the same color indicate independent runs with different random number streams. 
(b) The evolution of the amount of data used by the sampling method. 



the limited-memory Hessian is based on a history of length 4. Nocedal and Wright 
[31, §7.2] describe the recursive procedure for updating H k . The batch size is 
increased at each iteration by only a single element, Le., 



s k+1 = min{ m, s k + 1 }. 



The members of the batch are redrawn at every iteration, and we use an Armijo 
backtracking linesearch based on the sampled function (l/ s <0 X^gs fiifa)- 

The convergence plots for several runs of the sampling method and the stochastic 
gradient method with a = IO -6 are shown in Figure 5(a). Figure 5(b) plots the 
evolution of the amounts of data sampled. 
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6 Discussion and conclusions 

The numerical experiments we have conducted using the Student's t-penalty are 
encouraging, and indicate that this approach can overcome some of the limitations of 
convex robust penalties such as the Huber norm. Unlike the least-squares and Huber 
penalties, the Student t-penalty does not force the residuai into a shape prescribed 
by the corresponding distribution. The sampling method successfully combines the 
steady convergence rate of the full-gradient method with the inexpensive iterations 
provided by the incremental-gradient method. 

The convergence analysis of the sampling method, based on Theorem 4.1, relies 
on bounding the second moment of the error in the gradient, and hence the variance 
of the sample average (see (3.4)). The bound on the second-moment arises because 
of our reliance on the concept of an expected distance to optimality E[||xfc — £*|| 2 ]- 
However, other probabilistic measures of distance to optimality may be more 
appropriate; this would infiuence our criteria for bounding the error in the gradient. 
For example, Avron and Toledo [4] measure the quality of a sample average using 
an "epsilon-delta" argument that provides a bound on the sample size needed to 
achieve a particular accuracy e with probability 1 — 5. 

Other refinements are possible. For example, van den Doel and Ascher [40] 
advocate an adaptive approach for increasing the sample size, and Byrd et al. [10] 
use a sample average-approximation of the Hessian. 
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